NESM: a Named Entity based Proximity Measure for Multilingual News Clustering

نویسندگان

  • Soto Montalvo
  • Víctor Fresno-Fernández
  • Raquel Martínez-Unanue
چکیده

Measuring the similarity between documents is an essential task in Document Clustering. This paper presents a new metric that is based on the number and the category of the Named Entities shared between news documents. Three different feature-weighting functions and two standard similarity measures were used to evaluate the quality of the proposed measure in multilingual news clustering. The results, with three different collections of comparable news written in English and Spanish, indicate that the new metric performance is in some cases better than standard similarity measures such as cosine similarity and correlation coefficient.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Multilingual News Document Clustering: Two Algorithms Based on Cognate Named Entities

This paper presents an approach for Multilingual News Document Clustering in comparable corpora. We have implemented two algorithms of heuristic nature that follow the approach. They use as unique evidence for clustering the identification of cognate named entities between both sides of the comparable corpora. In addition, no information about the right number of clusters has to be provided to ...

متن کامل

A Cluster-based Approach to Broadcast News

We present an approach to detection and tracking of topics in multilingual broadcast news based upon a dynamic clustering scheme. Our approach derives from a system used to filter Web searches from multiple sources, with extensions for pipelining document clusters, part-of-speech tagging and extraction of named entities for use in an extended similarity measure.

متن کامل

PAYMA: A Tagged Corpus of Persian Named Entities

The goal in the named entity recognition task is to classify proper nouns of a piece of text into classes such as person, location, and organization. Named entity recognition is an important preprocessing step in many natural language processing tasks such as question-answering and summarization. Although many research studies have been conducted in this area in English and the state-of-the-art...

متن کامل

Bilingual News Clustering Using Named Entities and Fuzzy Similarity

This paper is focused on discovering bilingual news clusters in a comparable corpus. Particularly, we deal with the news representation and with the calculation of the similarity between documents. We use as representative features of the news the cognate named entities they contain. One of our main goals consists of proving whether the use of only named entities is a good source of knowledge f...

متن کامل

Multilingual Topic Detection Using a Parallel Corpus

We have developed an approach for topic detection from multilingual news, in particular Chinese and English. We extract named entities such as people names, geographical location names, and organization names automatically from the news content by transformation-based linguistic taggers. These sets of named entities together with the remaining content terms form the basis of news representation...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • Procesamiento del Lenguaje Natural

دوره 48  شماره 

صفحات  -

تاریخ انتشار 2012